Skip to content

cp: fix: transformers v5.5.0 validation (2010) into r0.4.0#2013

Merged
akoumpa merged 1 commit intor0.4.0from
cherry-pick-2010-r0.4.0
Apr 23, 2026
Merged

cp: fix: transformers v5.5.0 validation (2010) into r0.4.0#2013
akoumpa merged 1 commit intor0.4.0from
cherry-pick-2010-r0.4.0

Conversation

@svcnvidia-nemo-ci
Copy link
Copy Markdown
Contributor

beep boop [🤖]: Hi @akoumpa 👋,

we've cherry picked #2010 into  for you! 🚀

Please review and approve this cherry pick by your convenience!

* catch StrictDataclassClassValidationError

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* test: limit nightly to stepfun recipe for stepfun CI repro

Temporarily prune nightly_recipes.yml to the single
stepfun/step_3.5_flash_hellaswag_pp.yaml recipe to iterate on the
StrictDataclassClassValidationError fix without paying for the full
nightly matrix. Not intended for merge.

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix: retry NeMoAutoTokenizer load when config trips layer_types validator

AutoTokenizer.from_pretrained internally calls AutoConfig.from_pretrained
to resolve the tokenizer class. For checkpoints whose config has
layer_types longer than num_hidden_layers (e.g. stepfun-ai/Step-3.5-Flash),
newer transformers rejects the config and huggingface_hub wraps the
ValueError in StrictDataclassClassValidationError (not a ValueError
subclass). The previous get_hf_config fix only covered the model-load
path; the tokenizer path hit the same failure independently.

On that specific validator failure, preload a config via get_hf_config
(which truncates layer_types) and retry the tokenizer load with an
explicit config=, bypassing the internal AutoConfig call.

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* refactor: relax validate_layer_type globally instead of preloading config

The previous tokenizer retry preloaded a fixed config via get_hf_config
and re-entered AutoTokenizer.from_pretrained with an explicit config=.
That round-trip is brittle (reconstructs a config the tokenizer does not
use) and only fixes the tokenizer call site.

Replace it with relax_layer_types_validator(): a one-shot monkey-patch
that swaps PretrainedConfig.validate_layer_type with a no-op and rewrites
the already-frozen validator entries in every live subclass's
__class_validators__ list. After that, any downstream call that instantiates
a config with mismatched layer_types/num_hidden_layers skips the check.

The tokenizer retry now just applies the patch and re-invokes
super().from_pretrained(...) with the original kwargs.

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix: retry VLM AutoProcessor load on layer_types validation failure

AutoProcessor.from_pretrained internally loads AutoConfig, so configs
whose layer_types length differs from num_hidden_layers trip
validate_layer_type through the processor path too. Previously the
VLM build_dataloader caught the error under a broad except and silently
set processor=None, producing a cryptic downstream failure.

On the specific validator signature, call relax_layer_types_validator()
and retry AutoProcessor.from_pretrained once. Unrelated exceptions keep
the original fall-through to processor=None with a warning. LLM tokenizer
path is already covered via NeMoAutoTokenizer.

Also pass --force-exclude to the ruff pre-commit hooks so the
tests/ exclusion already declared in pyproject.toml takes effect when
pre-commit passes files explicitly.

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* revert

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* revert

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fmt

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
@svcnvidia-nemo-ci
Copy link
Copy Markdown
Contributor Author

/ok to test c748500

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 23, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@akoumpa akoumpa merged commit 689e408 into r0.4.0 Apr 23, 2026
53 checks passed
@akoumpa akoumpa deleted the cherry-pick-2010-r0.4.0 branch April 23, 2026 06:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-pick Run CICD Trigger Testing CICD

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants